LLM Benchmarks

Large language models compared

LLM
benchmark

How do you tell which LLMs are “better”? Numerous benchmarks are under development.

Prometheus-Eval: a family of open-source language models for evaluating other LLMs.

LMSYS Chatbot Arena Leaderboard

Contribute your vote 🗳️ at chat.lmsys.org! Find more analysis in the notebook.

Ranking as of Apr 8, 2024

Rank 🤖 Model ⭐ Arena Elo 📊 95% CI 🗳️ Votes Organization License Knowledge Cutoff
1 GPT-4-1106-preview 1252 +3/-3 56936 OpenAI Proprietary 2023/4
1 GPT-4-0125-preview 1249 +3/-4 38105 OpenAI Proprietary 2023/12
4 Bard (Gemini Pro) 1204 +5/-5 12468 Google Proprietary Online
4 Claude 3 Sonnet 1200 +3/-4 40389 Anthropic Proprietary 2023/8
10 Command R 1146 +5/-6 12739 Cohere CC-BY-NC-4.0 2024/3
14 Gemini Pro (Dev API) 1127 +4/-4 16041 Google Proprietary 2023/4

Anthropic released its Claude 3 model and says it beats its competitors

Claude 3 performance compared

Benchmarking

Strange Loop Canon explains why it’s hard to evaluate LLMs: the same reason it’s hard to make a general-purpose way to evaluate humans.

Andrej Karpathy, a co-founder of OpenAI who is no longer at the company,  agrees:

People should be extremely careful with evaluation comparisons, not only because the evals themselves are worse than you think, but also because many of them are getting overfit in undefined ways, and also because the comparisons made are frankly misleading.

AgentBench is an academic attempt to benchmark all AI LLMs against each other.

Liu et al. (2023)

AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent’s reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 25 LLMs (including APIs and open-sourced models)

The result: OpenAI’s GPT-4 achieved the highest overall score of 4.41 and was ahead in almost all disciplines. Only in the web shopping task, GPT-3.5 came out on top.

The Claude model from competitor Anthropic follows closely with an overall score of 2.77, ahead of OpenAI’s free GPT-3.5 Turbo model. The average score of the commercial models is 2.24.

see ($ The Decoder)

Comparing AI models

And Chris Coulson rates Azure vs OpenAI API speeds to show that Azure is approximately 8x faster with GPT-3.5-Turbo and 3x faster with GPT-4 than OpenAI.

For the most part you don’t have to worry about price differences between Azure and OpenAI, as the price is the same for all of their standard features. Prices are also quite affordable, especially GPT-3.5-Turbo, as running several rounds of testing only cost me a few cents.

Nicholas Carlini has a benchmark for large language models (GitHub) that includes a simple domain-specific language that makes it easy to add to the current 100+ tests. Most of these benchmarks evaluate the LLM for its coding capabilities.

  • GPT-4: 49% passed
  • GPT-3.5: 30% passed
  • Claude 2.1: 31% passed
  • Claude Instant 1.2: 23% passed
  • Mistral Medium: 25% passed
  • Mistral Small 21% passed
  • Gemini Pro: 21% passed

A complete evaluation grid is available here.

Hugging Face has added four new leaderboards for measuring language models’ accuracy in answering questions relevant to businesses (finance, law, etc.), safety and securityfreedom from hallucinations, and ability to solve reasoning problems. Unfortunately, the leaderboards only evaluate open source models.

Hallucination

AI Hallucination Rate @bindureddy

DIY Benchmarking

^17b72a

A GitHub example of Evaluate LLMs in real time with Street Fighter III

#coding

See Also

China LLM Benchmarks: Chinese LLMs are still far behind

References

Liu, Xiao, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, et al. 2023. AgentBench: Evaluating LLMs as Agents.” https://doi.org/10.48550/ARXIV.2308.03688.